Enable running PyTorch models #207

masahi · 2024-02-14T05:55:06Z

This is the first cut toward running PT models under mlc-serve. Things are functional but rough. All we need to do to run a PT model is to replace --local-id llama-2-13b-chat-hf-q0f16 with --local-id models/llama-2-13b-chat-hf (the path to the HF model directory) etc.

I haven’t verified that it supports all sampler features that we recently added (logprobs, json etc). Parallel sampling works but evicting parallel-sampling request requires model change.
Right now it only supports vLLM models. They are directly loaded via from vllm.model_executor.models.llama import LlamaForCausalLM etc. Models such as phi-2, Qwen run out of the box, in addition to llama-based models and Mixtral that the TVM model also supports. vLLM needs to be installed from https://github.com/octoml/vllm/tree/for-mlc-serve
Single-gpu perf is not good. For llama 13B, using benchmark_throughput.py, TVM gets 22.98 requests/s while PT gets 13.78 requests/s. Since TVM and vLLM models should be using the same kernels for matmul and attention, there shouldn't be a big difference in performance. I haven't looked into this issue deeply.
Multi-gpu performance is terrible. Currently there is a huge overhead in serialization of inputs from main to worker processes, which can be seen below (large poll between two decode).

@sunggg @yelite @elvin-n @vvchernov @jroesch @binarybana

serve/mlc_serve/model/model_common.py

serve/mlc_serve/engine/engine_common.py

serve/mlc_serve/model/model_common.py

serve/mlc_serve/model/sampler.py

sunggg

Great work, Masa! A couple comments.

serve/mlc_serve/engine/base.py

serve/mlc_serve/engine/engine_common.py

sunggg · 2024-02-16T18:03:47Z

serve/mlc_serve/model/model_common.py

            block_tables.append(block_table.get_blocks())

            if sliding_window:
                seq_lens.append(min(seq_len, sliding_window))
            else:
                seq_lens.append(seq_len)

+            max_context_len = max(max_context_len, seq_lens[-1])
+
+    def _do_pad(


Now that we started considering vllm's tensor layout, what do you think about unifying it? It seems like upstream mlc-llm also uses 2D inputs.

And this also could help our cuda graph integration?

We haven't verified if 2D inputs is better for performance, and how much cuda graph actually helps.

The upstream input looks like 2D but it is always either (1, num_total_token) or (batch_size, 1). So their 2D input is essentially 1D.

Yeah, I think it is worth visiting imo. But not now, in the future. Although there might not be performance boost, it would be nice to unify the layout with upstream unless there is reason.

serve/mlc_serve/utils.py

serve/pyproject.toml

serve/mlc_serve/model/torch_model.py

masahi · 2024-02-20T20:54:17Z

Oh I just realized that, after I updated our vLLM work, which made their input representation 2D, memory profiling is broken.

Right now we only have max_num_batched_tokens as a parameter to memory profiling, and "the biggest input" to memory profiling is a batch of max_num_batched_tokens sequences each of which is length 1. But with 2D input, we also need max_seq_len param, since prefill inputs are padded to the max seq len in a batch. So memory profiling severely underestimates memory usage.

We could look into how vLLM does memory profiling currently, but I'm inclined to revert their 2D rep change instead. Either way, this is left for future work.

masahi · 2024-02-21T00:21:19Z

I opened a ticket to track many TODO items after this PR. #217

I think this is ready to merge for now @sunggg

* Prioritize arch-specific lib * Avoid repetitive dlopen when reloading the same model * handle the case where lib path remains the same

masahi added 30 commits January 11, 2024 19:31

refactor to separate TVM specific bits from paged_cache_model

12ce0a3

fix

7a84f15

Remove engine config change for now

f454b7b

make mypy happy with TextGenerator impl by Model

afde741

stub

c49ef45

wip

d9ac72f

wip

acbf825

wip

fef750f

PT model memory profiling works

25a567e

get rid of vllm prepare_inputs

3d06f68

wip

3cafc8b

model runs but nan output

34f77ef

mypy improvement

afb4d4f

runs e2e but the result is garbage

e7212a5

working

f27e3b3

minor

2316e37

do sampling by mlc function

9b985e8

Merge branch 'batch-serving' into pt-model

f2dcc48

merge fix

4d73e63

wip parallel sampling

15a0d3b

fix test

959019d

wip

b6050d9

fix

ff8eb27

wip

8696df5

wip

0af3a70

wip

90ffccd

attach cache_blocks to model

32686d8

change get_num_cache_blocks signature

de2631b

wip

618ca62

wip

9ce2f47

masahi added 9 commits February 13, 2024 20:56

Properly verify sampling params in api handler

f128fe6

Create model artifact config before module initialization

568583a

fix engine start

762012d

Merge branch 'sampling-params-init-fix' into pt-model

72f3707

Merge branch 'batch-serving' into pt-model

3128329

fix

dc5fb6e

black

ebe0b4e

properly handle import failure

4b2de70

add titoken dep

f09d458

masahi commented Feb 14, 2024

View reviewed changes

serve/mlc_serve/model/model_common.py Show resolved Hide resolved

masahi commented Feb 14, 2024

View reviewed changes

serve/mlc_serve/engine/engine_common.py Show resolved Hide resolved

masahi commented Feb 14, 2024

View reviewed changes

serve/mlc_serve/model/model_common.py Show resolved Hide resolved

masahi commented Feb 14, 2024

View reviewed changes

serve/mlc_serve/model/sampler.py Outdated Show resolved Hide resolved

masahi added 5 commits February 14, 2024 18:59

revert logprob change

c9ac5ba

restored tokenizer.is_fast assert but commented out

f1cf274

Merge branch 'batch-serving' into pt-model

eaa53a7

fix vocab siz

1336fb8

properly account for logits storage in memory profiling

6186ef2

sunggg reviewed Feb 16, 2024

View reviewed changes

masahi added 5 commits February 20, 2024 19:53

Merge branch 'batch-serving' into pt-model

2229324

merge fix

aa4d477

validate num_shards in engine creation

8bb96ed

replace print with structlog

cf0813d

add peak memory log for tvm as well

f716851

masahi mentioned this pull request Feb 21, 2024

[Tracking] PT model support follow up #217

Open

8 tasks

add tokenizer.is_fast warning on creation

992b1a0

sunggg merged commit a377c3c into octoml:batch-serving Feb 22, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable running PyTorch models #207

Enable running PyTorch models #207

masahi commented Feb 14, 2024 •

edited

Loading

sunggg left a comment

sunggg Feb 16, 2024

sunggg Feb 16, 2024

masahi Feb 20, 2024

sunggg Feb 21, 2024

masahi commented Feb 20, 2024

masahi commented Feb 21, 2024 •

edited

Loading

Enable running PyTorch models #207

Enable running PyTorch models #207

Conversation

masahi commented Feb 14, 2024 • edited Loading

sunggg left a comment

Choose a reason for hiding this comment

sunggg Feb 16, 2024

Choose a reason for hiding this comment

sunggg Feb 16, 2024

Choose a reason for hiding this comment

masahi Feb 20, 2024

Choose a reason for hiding this comment

sunggg Feb 21, 2024

Choose a reason for hiding this comment

masahi commented Feb 20, 2024

masahi commented Feb 21, 2024 • edited Loading

masahi commented Feb 14, 2024 •

edited

Loading

masahi commented Feb 21, 2024 •

edited

Loading